O projeto

Neste projeto, você irá usar o R e aplicar técnicas de análise exploratória de dados para verificar relações em uma ou mais variáveis e explorar um conjunto de dados específico para encontrar distribuições, outliers e anomalias.

Análise Exploratório de dados (Exploratory Data Analysis, ou EDA) é a análise numérica e visual das características de dados e seus relacionamentos usando métodos formais e estratégias estatísticas.

EDA pode nos trazer insights, que podem nos levar a novas questões, e eventualmente a modelos preditivos. É uma importante “linha de defesa” contra dados ruins e uma oportunidade de comprovar se suas suposições ou intuições sobre um conjunto estão sendo violadas.

Introdução

Essa análise irá explorar um conjunto de dados de vinhos tintos [Cortez et al., 2009], originalmente construído para modelagem da qualidade do vinho refletida por aspectos químicos de cada bebida. O conjunto de dados tem 1599 registros com 11 variáveis (de aspecto químico) + qualidade do vinho (de 0 a 10) reportada por profissionais da área. Obtive a ajuda de um amigo formado em química para me guiar em possíveis aspectos quimícos que podem gerar um gosto desagradável no vinho, e sob essas hipoteses guiarei minha analise.

Seção de Gráficos Univariados

Visão Geral

Para iniciar iremos analisar cada variável separadamente para termos uma ideia do que estamos lidando:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Podemos ver que os dados estão bem formatados e embora algumas colunas aparentem ter outliers nada parece fora do normal.

Primeiro removemos a coluna de index que não é necessária.

Qualidade

Começaremos pela variável qualidade:

Embora tenhamos notas possiveis de 0 a 10 os dados apresentam notas apenas no intervalo 3-8 com pico no 5 e poucos exemplos nos extremos, olharemos de forma mais detalhada:

Vinhos piores:

## [1] 63

Vinhos melhores:

## [1] 217

Apenas 18 vinhos receberam a nota mais alta dos jurados e os de qualidade baixa também se encontram com pouca representatividade, iremos voltar a essa analise posteriormente.

Álcool

Agora analisaremos a quantidade de álcool.

A quantidade de álcool mais comum está por volta de 9.4, com uma distribuição bem irregular (talvez uma binormal), talvez seja interessante criar subconjuntos das diferentes qualidades de vinhos para analisar melhor.

Não está muito claro devido a baixa amostragem de dados para binhos bons mas aparenta que vinhos melhores tenham mais álcool que vinhos ruins, suponho que pelo tempo de fermentação que vinhos melhores levam eles acumulam mais alcool, mas para ter mais confiança dessa afirmação é necessário uma analise de regressão.

Açúcar residual

Agora analisaremos o açúcar residual dos nossos vinhos contém.

Com uma distribuição de cauda pesada devemos setar aumentar a precisão no eixo x e aumentar a quantidade de barras para visualizar melhor.

Existe um pico ao redor do 2, vamos analisar essa região.

Neste intervalo os dados parecem estar distribuidos de forma normal, sendo onde a maioria dos vinhos se encontram, para as outras regiões talvez encontremos outliers quanto a qualidade do vinho, vinhos muito doces tendem a ser considerados ruins.

Agora voltemos a analisar a distruibuição de cauda pesada, para isso renormalizamos aplicando uma scala logaritmica.

Bem melhor, agora podemos ver um mini pico para os dados acima de 10.

Vamos analisar agora o açúcar residual nos vinhos outliers:

As modas estão em 2 porém os vinhos ruins possuem outliers a muitos desvios padrões da média (13), e as distribuições são de cauda pesada.

Cloretos

Cloretos indicam a salinidade no vinhos, não podendo conter em excesso, estragando o vinho.

Aqui também com cauda pesada iremos aplicar a transformação log.

Como é visivel, existe uma grande acumulação entre 0.07 e 0.09, e outliers a esquerda e direita.

Vejamos como eles desempenham:

Os de pouca salinidade tiveram notas altas, interessante.

pH

Vemos agora o pH que descreve a acidez/basicidade do vinho na escala de 0 a 14.

Aqui vemos uma distribuição normal e bem centrada, vejamos a relação com a qualidade dos vinhos.

Não é visivel nenhuma diferença significativa entre os vinhos.

Densidade

A densidade depende da quantidade de alcool e açucar residual, vejamos como está essa distribuição.

Nada fora do comum por aqui, mas vejamos como está em relação a qualidade.

Não há uma separação significativa entre as distribuições.

Ácido citrico

Uma das principais caracteristicas do sabor do vinho, talvez a mais interessante dos dados.

Os dados estão com uma distribuição muito estranha, não sendo claro alguma forma de analisa-los, mas como esperado é uma caracteristica distoante entre os vinhos. Vejamos mais de perto entre os picos:

Vamos ver agora a concentração para vinhos bons e ruins separadamente:

Para os vinhos ruins está uma cauda pesada com centro a esquerda e esparsa, ja para os vinhos bons uma distribuição talvez binormal.

Sulfatos

Next up is sulphates, added to wines as an antimicrobial and antioxidant.

The distribution has a long right tail, so - again - log-transforming the data might help fix this issue:

Now it looks more normally distributed, with a pronounced peak at around 0.53. The initial histogram exposes a few outliers that have more than 1 gram of sulphates per liter - let’s look at those:

Those wines did pretty well, scoring 6 and 7.

As usual, we’ll now examine concentration of sulphates in poor and excellent wines:

The bulk of poor wine samples seems to be more tightly packed, whereas excellent wine samples look more spread out, peaking at 0.45 and 0.4, respectively.

Fixed and volatile acidity

In this subsection, we’ll be looking at concentration of tartaric and acetic acids. The latter, at too high levels, can make a wine taste like vinegar.

We might benefit from a more granular histogram here:

Now it’s easier to see the peak value, which is around 6.5.

In the initial histogram, some outliers are immediately obvious. We’ll examine a few of them more closely:

Almost half of those are poor wine samples and the rest are of medium quality (5 and 6). To check if wine quality drops as tartaric acid concentration increases, we might want to compare this concentration in poor and excellent wines, so that we can draw some tentative conclusion:

Both the peak values seem equal to 7, although the distribution of poor wine samples is more right-skewed, whereas that of excellent wine samples is left-skewed.

We’ll analyze acetic acid the same way as we did tartaric acid.

This distribution has a long right tail, so we can proceed in two ways: just chop the tail off by applying the limit to the X axis, or log-transform our data. Let’s try to do both for a change and see what results we end up with.

We get almost the same peak value of about 0.28, although it’s a bit off to the left (by circa 0.01) in case of log10 transform.

Let’s now take a closer look at some of the outliers that contain over 0.9 grams of acetic acid per liter:

We can see those are poor to medium wines, which seems to be in line with the above statement from the dataset description that claims that higher concentrations of this acid lead to a pronounced taste of vinegar in a wine sample. Of course, to draw any conclusions, a more in-depth analysis is needed, which we’ll undertake in the next sections. For now, we’ll try to find out what concentrations of acetic acid are typical of poor and excellent wines.

Peak values are almost equal, although the figure seems to be a bit greater for poor wines (0.28 vs 0.26). Moreover, the distribution of poor wines has a longer right tail that extends beyond higher values than that of excellent wines.

Free, bound, and total (free + bound) sulfur dioxide

Here, we’ll be exploring SO2 levels in our wine samples, starting off with free SO2.

Almost all the wine samples sit under the value of 100, so we might want to zoom in a bit:

This distribution seems to peak at a value close to 30.

Let’s also examine the outlier situated far off to the right in the first histogram:

Turns out it’s quite a low-quality wine sample.

Time to compare how much free SO2 is contained in poor and excellent wines:

Looks like lower levels of free SO2 are more typical of poor wines, with the peak at 2 and the bulk of the data sitting between 2 and 37. For excellent wines, most wine samples fall in the 27-47 range, with the peak at 29. One more interesting thing: the poor wines distribution here is the only one so far that looks like an exponential one, which means higher levels of free SO2 are much more rarely observed in poor wines than low ones.

Since total SO2 = free SO2 + bound SO2, we can create a new variable for bound SO2 and analyze it separately, same as we did with free SO2.

Let’s see some high-level information about our new variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   21.00   30.59   39.00  251.50

Now we can build a few histograms to better understand how it behaves.

We can clearly see an outlier containing over 300 mg/l of bound SO2 - let’s go microscopic on it:

Same as with free SO2 above, it’s also a low-quality wine, with a score of 3.

Now we’ll apply some breaks and limits to the X axis to be able to see separate values more clearly:

The peak’s become more discernible - it’s about 82.

Moving on to comparing the levels of bound SO2 in poor and excellent wines.

The peaks for the poor wines and excellent wines distributions are about 104 and 76, respectively. The bulk of the poor wine samples falls in a wider range between 44 and 168, whereas it’s just between 56 and 112 for excellent wines. Based only on this quick visual comparison, we can tentatively say that excellent wines tend to contain less bound SO2 than poor wines do.

The last to be analyzed in this subsection is the total level of SO2, which, I assume, must strongly correlate with both the level of free SO2 and the level of bound SO2 since it’s just the sum of the two. Let’s see if the total SO2 histograms we build are very different from what we had for free and bound SO2.

This histogram also shows an outlier with a pretty high total level of SO2. I believe it’s the same wine sample that we looked at above, when dealing with either free or bound SO2, but let’s check it to be sure:

Indeed, it’s the very same low-quality wine sample that we picked out earlier in the analysis.

Same as with free and bound SO2, let’s build a more granular histogram to see more clearly how the values are distributed:

Judging by the histogram, the mode of this distribution is somewhere near 114.

What’s left for us is to explore the total levels of SO2 across poor and excellent wines:

The poor wines histogram peaks at 109 and then at 189, whereas the excellent wines histogram shows two distinct peaks situated fairly close to each other - at 99 and 119. Also, poor wine samples are more spread out across the X axis, and the poor wines distribution seems to have a left tail.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4898 wine samples with 11 attributes (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol; I’m not counting the X variable here, since, like I said above, it’s just a duplicate of the index, so I dropped it from the dataset before going on with the analysis) and a final grade each sample received from the professional wine judges based on those attributes.

What is/are the main feature(s) of interest in your dataset?

The main feature is quality, because this whole analysis is driven by the question “what influences the quality of white wine?”. In the next two sections, I’m going to focus on exploring the relationships between quality and other features and their combinations.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Based on the univariate analysis I’ve performed so far, I have a reason to believe features like alcohol content, chlorides, levels of SO2, fixed and volatile acidity might be more or less reliable indicators of wine quality, but it’s hard to say anything for sure until bivariate and multivariate analyses are carried out and feature relationships are explored in various ways.

Did you create any new variables from existing variables in the dataset?

Since I had data on both free SO2 and total SO2 in wines, I created a new variable called bound sulfur dioxide (SO2) by subtracting free SO2 from total SO2. I’m yet to analyze this variable closer in the next sections of my analysis, but for now it seems like higher-quality wine samples tend to contain slightly lower levels of bound SO2.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I didn’t have to clean anything or fill any gaps as this dataset is prepared in such a way that there’s no missing data in it.

As I mentioned above, at the beginning of my analysis, I got rid of the X variable since it was just a duplicate of the index and didn’t help me in any way.

All the remaining features in the dataset seem more or less normally distributed, but some of the distributions are positively skewed: residual sugar, chlorides, sulphates, volatile acidity. I log-transformed (log10) all of them to solve the issue of long tails. In fact, the residual sugar distribution turned out to be bimodal, with peaks at 1.3 and 8.

One thing I noticed is that concentration of almost all the substances is given in grams per cubic decimeter (or grams per liter, which is the same thing, and I prefer this notation), with a notable exception of levels of SO2, which are given in milligrams per liter. Further down the road, it might be worthwhile to convert them to grams per liter to see if anything changes. Same story with density, which can later be converted from grams per cubic centimeter to grams per liter to see if that transformation brings anything new and unexpected to the analysis.

Bivariate Plots Section

In this section, relationships between pairs of features will be examined. One such relationship is correlation, and the quickest way to obtain pairwise correlations for the whole dataset is to use a ggpairs() function from a library called GGally.

We can see a more or less pronounced (I defined the threshold to be abs(0.35)) correlation between the following pairs:

Positive:

Negative:

Our main variable, quality, is correlated the most with alcohol (0.436), density (-0.307), bound sulfur dioxide (-0.218), chlorides (-0.21), and volatile acidity (-0.195).

It would make sense to concentrate our efforts on studying the identified correlated pairs more carefully.

Scatter plots for positively correlated features

This pair displays the strongest positive correlation in the dataset, as can be clearly seen in the scatter plot above: density tends to grow almost linearly as the level of residual sugar gets higher.

These two plots show the second and third most pronounced correlations discovered: density vs bound and total SO2. As bound SO2 is a part of the total SO2, strong correlations were to be expected in both cases, and the plots do support these expectations. One can’t help but notice that the two plots look very much alike. It’s because correlation figures are very close numerically: 0.53 for density vs total SO2 and 0.504 for density vs bound SO2. Besides, distributions of bound SO2 and total SO2 are also fairly similar.

Here we can also see a positive correlation and therefore a general upward trend: wine quality seems to improve given higher alcohol content. There are lots of exceptions, of course, but the plot doesn’t provide the whole picture of what’s going on, so we’ll need to examine this case more closely further down the road.

These are the scatter plots for the most weakly correlated pairs (given the threshold of abs(0.35)): residual sugar vs total and bound SO2. Like with density plots above, these are also hard to tell apart since the two levels of SO2 are similarly distributed and strongly correlated with each other. The main takeaway from these plots is that sweeter wines tend to contain higher levels of bound and total SO2.

Scatter plots for negatively correlated features

The scatter plot above shows the most pronounced negative correlation discovered - alcohol vs density: as alcohol content grows, wine density tends to almost linearly decrease.

As can be understood from the slope of the line of best fit, the second strongest negative correlation is much weaker. In fact, it’s almost 1.75 weaker than the strongest correlation. To be honest, this one was a surprise for me: after doing some research, I was expecting the concentration of residual sugar and alcohol content to be positively correlated - looks like I couldn’t be more wrong!

The two plots above display the relationships between alcohol and levels of bound and total SO2 and seem to reinforce an earlier discovered positive correlation between quality and alcohol content. The intuition behind this is as follows: since higher alcohol content tends to be found more often in higher-quality wines and a high level of any SO2 is bad for a wine (as it will show in the nose and taste of this wine), it may be assumed that wines containing more alcohol (and therefore often having a higher grade) tend to have lower levels of bound and total SO2, which is clearly the case here.

The two final plots are for the two most weakly correlated pairs (above the predefined threshold of abs(0.35)): fixed acidity vs pH (-0.426) and alcohol vs chlorides (-0.36). The first plot makes intuitive sense: pH measures how acidic or basic a wine is, 0 being very acidic and 14 being very basic. As fixed acidity of a wine increases, its pH level drops accordingly, meaning that this particular wine is closer to being acidic on the pH scale. The second plot was in line with my expectations: I assumed not many people would enjoy drinking a wine with a high level of chlorides, as it would be too salty, if that’s a suitable description for a wine. Therefore, wines featuring higher alcohol content (and having a higher grade) would need to contain little salt in them to be judged as being high-quality. This assumption seems to hold true in this case, if the plot is to be believed.

Box plots for quality

In this subsection, we’ll look at how quality varies with the rest of the features and try to find out if any feature allows definitely telling a good wine from a bad one.

Quality vs fixed acidity

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.225  12.600

Best wines in the dataset have the highest minimum fixed acidity (6.6) and one of the highest medians, along with the worst wines. At the same time, wine samples rated 8 or 9 have the two lowest maximum levels of fixed acidity - 8.2 and 9.1.

Quality vs volatile acidity

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

This box plot shows something of a wave-like pattern in terms of median volatile acidity: it starts growing, reaches its peak, then hits the bottom at 6 and 7, and grows again towards the best wine samples. One thing to note here is that the best wine samples have the highest minimum (0.24) and the lowest maximum (0.36) values of volatile acidity.

Quality vs citric acid

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Median level of citric acid doesn’t seem to vary too much across different grades, except for two spikes at 3 and 9. Best and worst wine samples have the highest minimum (.29 and 0.21, respectively) and lowest maximum values (0.49 and 0.47.respectively) of citric acid concentration.

Quality vs residual sugar

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

We can see pronounced fluctuations of the median level of residual sugar across the wine grades, the highest being 7 (grade 5) and the lowest 2.2 (grade 9). However, the sweetest wine sample in the dataset (65.8) has a grade of 6. Once again, the best wine samples have the highest minimum (1.6) and lowest maximum (10.6) levels of residual sugar.

Quality vs chlorides

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

Wine samples graded 5 have the highest median level of chlorides (0.047), and after that concentration goes downward and hits the bottom at grade 9 - 0.0274. The best wine samples also have the lowest maximum level of chlorides, which is 0.035 - it’s at least 3.5 times lower than the runner-up (0.121) and almost 10 times lower than the greatest value in the dataset (0.346).

Quality vs free sulfur dioxide

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00

Wine samples with the grade of 4 seem to have the lowest median level of free SO2 among all and the second greatest maximum level of free SO2 (138.5), topped only by wine samples graded 3 (max value - 289). Most values seem to be lying below the threshold of 50 (the maximum third quartile), above which SO2 might become evident in the nose and taste of wine and influence its quality.

Quality vs bound sulfur dioxide

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    6.75   11.00   13.90   13.75   37.00 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    8.00   14.00   23.98   32.00  107.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   14.00   29.00   39.53   58.00  128.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   11.00   19.00   25.16   33.00  126.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    8.50   15.00   20.97   21.50  251.50 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00    9.25   11.00   20.17   22.75   76.00

Median bound SO2 level of better wines tends to lie below 100 (true for grades 6 through 9), which seems to be in line with the negative correlation (about -0.2) we’ve discovered earlier, and it reaches the lowest value of 82 at grade 9. As was to be expected, the worst wines have the highest maximum level of bound SO2 (331), whereas the best ones have the lowest maximum level, 112 - a 3-time difference!

Quality vs total sulfur dioxide

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

Since total SO2 = free SO2 + bound SO2, we can observe the same patterns as above, when we analyzed levels of free and bound SO2 in wines. For example, the median level of total SO2 tends to lie below 150 for better wines, with a notable exception of grade 4, which has the lowest median value of all - 117 (heavily influenced by the low level of free SO2 for this grade).

Quality vs density

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0032 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

The median density tends to decrease as the quality grows, the only group that breaks this trend is wine samples of grade 5, which have the highest median density of all - 0.9953. This finding seems to be in line with what we’ve discovered previously: to refresh our memory, quality vs density is the strongest negative correlation for our main variable (-0.307).

Quality vs pH

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.163   3.230   3.267   3.350   3.720

Most wine samples lie in the range between 3 and 3.3 on the pH scale, and the median values fit into an even narrower range of 3.15-3.3, varying in a slightly discernible bowl-like fashion: it all starts at 3.215, gradually falls to 3.16, and then starts growing again, peaking at 3.28 (wine samples graded 9).

Quality vs sulphates

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

All the median values sit between 0.4 and 0.5, with the worst and best wine samples having the lowest median (as well as maximum) levels of sulphates. Otherwise, there’s little change across median pH values.

Quality vs alcohol

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

This box plot reinforces our earlier finding saying that there’s a strong positive correlation (0.436) between alcohol content and wine quality - turns out it’s especially true for wine samples of higher grades, whereas for lower-quality wines the trend is actually downward - the quality improves with lower alcohol levels. The best wines have the highest median level of alcohol, which seems to be significantly different from some of the other median values of lower-quality wines. It means that in this case the median can be used as a more or less reliable predictor of wine quality: if it’s below 12, a wine sample couldn’t have scored more than 7. Of course, such conclusions are restricted to our dataset only - the situation might be quite different for the whole population of wines.

Color-coded density plots

Let’s also build a few color-coded density plots for some of the features that formed the most strongly correlated pairs.

In all these plots, we can clearly see a bimodal distribution for the best wines. I guess this effect is due to there being very few wine samples with grade 9 in the dataset that take on just several values. Our analysis might have benefited from a greater number of highest-quality wines, as we could’ve checked whether this pronounced bimodality has to do with insufficient data or there’re some other factors at play.

As for the last density plot for residual sugar, the distributions seem quite skewed - and indeed, in the first section of this analysis, we’ve found out that the residual sugar distribution has a very heavy right tail. Let’s now try rebuilding the same density plot, but with the residual sugar variable log-transformed.

Now it becomes obvious this distribution is actually bimodal across all wine grades! Pretty curious finding that I can’t explain right away for the lack of the domain knowledge. It might even be a phenomenon peculiar to Portuguese wines - it’s really hard to tell without having more data handy.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The strongest positive correlation involving quality is quality vs alcohol (0.436). One particularly interesting thing here is that an upward trend (quality increases as alcohol content grows) holds true only for higher-quality wines, starting from the grade of 6; below this point the trend is actually downward: for wine samples graded 3-5, the lower the alcohol level, the better the wine. The median alcohol value of less than 12 indicates that a wine sample’s maximum score is 7, which might help us tell a good wine sample from a poor one.

The most pronounced negative correlation that has to do with our main feature is observed in the pair quality - density (-0.307). The general trend there is a downward one: with each grade, median density decreases a bit, with a notable exception of one group - wine samples of grade 5, which break this trend and actually have the greatest median density of all grades. The exactly same picture can be seen in quality vs bound SO2 (-0.218): grade 5 wines once again break the generally downward trend.

Another interesting pattern was discovered in the pair quality vs volatile acidity (-0.195): median values there seemed to change in a wave-like fashion from one grade to another, going up and down a few times.

One more curious finding was that the residual sugar distribution, which is highly skewed initially, when log-transformed and color-coded by quality, is actually bimodal across all the wine grades, from lowest to highest. As I said above, under the relevant plot, I might be lacking some specialist knowledge to draw the right conclusion based on this fact, or it might just be a peculiarity of Portuguese wines, white ones in particular.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Fun fact: positive correlations were dominated by density (3 occurrences out of 6), negative ones by alcohol, featured even more prominently (5 occurrences out of 6). Therefore, it’s only natural that these two features produced the most highly correlated pairs (which I’m talking about in more detail in the subsection below), and density had a part in both of them!

Among other things, total SO2 and bound SO2 turned out to be positively correlated with both density and residual sugar. As for the negative correlation, one of the strongest relationships were observed in such pairs as: total SO2 and free SO2 vs alcohol; pH vs fixed acidity; alcohol vs residual sugar and chlorides.

What was the strongest relationship you found?

Surprisingly enough, the two most pronounced correlations didn’t involve the main variable, quality, but instead featured density, which seems to be heavily dependent on both residual sugar and alcohol content. In the former case, the correlation is positive and equals 0.839; in the latter case, the features are negatively correlated (-0.78).

Multivariate Plots Section

In the previous section, we used box plots to see how different variables are distributed across wine grades and scatter plots to discover interesting pairwise relationships between the features. This section allows us to take our analysis one step further by combining the two techniques and examining what relationships the features display (and how these relationships vary) across wine grades.

Scatter plots faceted by quality

Let’s first take a look at a couple of scatter plots for the features that exhibited the strongest correlation, faceted by quality.

Looks like no surprises here. Scatter plots demonstrate the same trends across all wine quality grades: upward for density vs residual sugar and downward for density vs alcohol.

I wonder what plots would look like for less correlated features.

For the lowest-quality wines, alcohol doesn’t seem to be correlated with residual sugar at all, with a negative trend becoming more noticeable towards higher wine grades.

Somewhat similar picture here. In case of the worst and best wines, alcohol and total So2 are much less correlated (if correlated at all) as compared with wine samples of other grades, which all display a more prominent downward trend.

This time the weakest correlation between the features takes place with the best wine samples. In all other cases, an upward trend is obvious.

Building a simple linear model

We’ll now build a pretty straightforward linear model to see how well it can predict wine quality based on the features we’ve analyzed.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + residual.sugar, data = wine)
## m3: lm(formula = quality ~ alcohol + residual.sugar + density, data = wine)
## m4: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity, 
##     data = wine)
## m5: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity + 
##     pH, data = wine)
## m6: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity + 
##     pH + sulphates, data = wine)
## m7: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity + 
##     pH + sulphates + free.sulfur.dioxide, data = wine)
## 
## =========================================================================================================================
##                             m1            m2            m3            m4            m5            m6            m7       
## -------------------------------------------------------------------------------------------------------------------------
##   (Intercept)              1.875***      1.882***    -42.884***    -24.273*      -13.811        -0.150         2.280     
##                           (0.175)       (0.176)      (12.051)      (11.433)      (11.858)      (11.944)      (12.107)    
##   alcohol                  0.361***      0.361***      0.401***      0.339***      0.346***      0.325***      0.320***  
##                           (0.017)       (0.017)       (0.020)       (0.019)       (0.019)       (0.019)       (0.020)    
##   residual.sugar                        -0.004        -0.026        -0.016        -0.015        -0.007        -0.003     
##                                         (0.013)       (0.014)       (0.013)       (0.013)       (0.013)       (0.013)    
##   density                                             44.547***     27.216*       17.881         3.630         1.209     
##                                                      (11.990)      (11.367)      (11.702)      (11.812)      (11.975)    
##   volatile.acidity                                                  -1.359***     -1.272***     -1.154***     -1.160***  
##                                                                     (0.096)       (0.099)       (0.100)       (0.100)    
##   pH                                                                              -0.383**      -0.303*       -0.290*    
##                                                                                   (0.119)       (0.119)       (0.119)    
##   sulphates                                                                                      0.628***      0.642***  
##                                                                                                 (0.104)       (0.105)    
##   free.sulfur.dioxide                                                                                         -0.002     
##                                                                                                               (0.002)    
## -------------------------------------------------------------------------------------------------------------------------
##   R-squared                0.227         0.227         0.233         0.319         0.324         0.339         0.340     
##   adj. R-squared           0.226         0.226         0.232         0.318         0.322         0.336         0.337     
##   sigma                    0.710         0.711         0.708         0.667         0.665         0.658         0.658     
##   F                      468.267       234.040       161.879       187.064       152.580       136.047       116.861     
##   p                        0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood       -1721.057     -1721.016     -1714.127     -1618.932     -1613.786     -1595.704     -1594.954     
##   Deviance               805.870       805.829       798.915       709.235       704.685       688.926       688.280     
##   AIC                   3448.114      3450.031      3438.254      3249.864      3241.573      3207.408      3207.908     
##   BIC                   3464.245      3471.540      3465.139      3282.127      3279.213      3250.425      3256.302     
##   N                     1599          1599          1599          1599          1599          1599          1599         
## =========================================================================================================================

The variables in this linear model can account for 28% of the variance in the quality of white wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The most prominent correlations we’ve discovered were in fact so strong that, when faceted by wine quality, the features displayed the same trends across all wine grades: for density vs residual sugar, the trend was always upward, for density vs alcohol always downward.

For other, less correlated features (alcohol vs residual sugar, alcohol vs total SO2, density vstotal SO2), the trend across the wine grades was also the same, with an exception of best or worst wines, or both, whereby features showed little to no correlation whatsoever.

Were there any interesting or surprising interactions between features?

Since the correlation between density and residual sugar was quite higher than that of density and alcohol (0.839 vs -0.78), I was epsecially interested to see how residual sugar and alcohol were correlated and expected at least a slightly positive correlation. To my surprise, the correlation turned out to be strongly negative (-0.451, second strongest among negative correlations discovered); in fact, it was so strong that a negative downward trend manifested itself across 6 out 7 wine grades represented in the dataset, except for grade 3, where features showed no correlation at all.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I did create a linear model that makes a prediction based on 7 features from the dataset. Further increasing the number of features didn’t yield any significant improvement, so I stopped at this value. Surprisingly enough, the model explains a mere 28% of the variance in the target variable, which is quality. It seems like wine quality is not well supported by its physico-chemical properties. Two things to note here: first, quality of prediction could be improved with more data (right now, it’s less than 5,000 samples); second, there’re some other factors at play, so the model might have benefited from addition of such variables as price of wine, region where it was produced, year it was produced and other things not related to wine chemistry. Trying out other models may also lead to better results. Say, I have a hunch that tree-based methods would do well in this case.


Final Plots and Summary

Plot One

Description One

This box plot supports our finding saying that the strongest positive correlation our main variable of interest is involved in is quality vs alcohol (0.436). An interesting thing here is that for lower wine grades, we can actually observe a negative downward trend that gets reversed only from grade 5 onwards. Thus, for wines of up to grade 5, the lower the alcohol content, the better a wine tends to be; after that wine quality grows linearly with increasing alcohol content.

Moreover, the median (and mean as well) alcohol content of best wines looks significantly different from that of worst wines, which can be used to more or less reliably tell a quality wine from a poor one.

Plot Two

Description Two

When plotted unmodified, the residual sugar distribution is highly skewed and has a long right tail. However, when log-transformed, the distribution becomes bimodal. When I later color-coded the plot, I saw the distribution was in fact bimodal across all the wine grades. Intrigued by this phenomenon, I read a few specialized articles on residual sugar in wines, but couldn’t find any explanation that would satisfy me. Therefore I’m inclined to think, for the lack of proof to the contrary, that it’s just a regional thing specific to Portuguese wines.

Plot Three

Description Three

This faceted scatter plot illustrates the third strongest negative correlation discovered during the analysis - alcohol vs total SO2. Each subplot contains a line of best fit that visually reinforces the trend across wine grades. One interesting observation here is that with best and worst wines, the features display little to no correlation whatsoever, whereas for wines of grades 4 through 8, a clearly negative downward trend manifests itself. It might be an indication of the fact that this particular combination of features is a bad candidate for predicting wine quality. Indeed, when I was building a linear model, alcohol turned out to be the best contributor to the overall quality of prediction, whereas total SO2 added absolutely nothing to improve it and therefore was not included in the resulting model.


Reflection

The dataset I’ve analyzed contains information on almost 5,000 white wines across 11 variables plus the output variable based on sensory data, that is a grade on a scale of 0 to 10 given to each wine sample by professional wine judges. This dataset is restricted to Portuguese wines and contains only their physico-chemical properties.

I began my analysis by building histograms of each feature to understand their distribution. They turned out to be normally distributed, with a few notable exceptions (take residual sugar as an example), where I observed heavy skew and long tails. Log-transforming these variables helped me deal with this abnormality. I also defined thresholds for poor (grade 4 and under) and excellent (grade 8 and over) wines, then subset my dataset using these thresholds and plotted distributions of individual features across poor and excellent wines side by side. This helped me see whether these distributions were very different and identify a few potential candidates that could be useful in telling a low-quality wine from a better one.

I went on to explore pairwise relationships between the features and pick out the most strongly correlated (both positively and negatively) pairs to focus my analysis on them. To my surprise, the main variable of interest - quality - wasn’t involved in any of the strongest correlations identified. I built a few scatter plots and included a line of best fit for each of them to more clearly see the general trend in the data points. Then I added a few box plots that reinforced my earlier findings and offered some new insights.

My greatest success was finding out that alcohol content was the most influential feature that could more or less reliably be used to differentiate between poor and excellent wines. Indeed, when I later built a linear model to predict a wine grade, this feature alone contributed over 70% to the overall prediction quality.

In the final part of my analysis, I used wine grades to color-code and facet a few plots that I’d built previously to see if any variables reinforce each other across any of the wine grades. The main finding here was that in the two most strongly correlated pairs the corelation was so pronounced that the trend stayed the same across all wine grades: it was always upward for density vs residual sugar and downward for alcohol vs density. The situation was a bit different for more weakly correlated pairs: the trend did stay the same across most wine grades, but with worst or best wines, the features I was analyzing displayed little to no correlation at all (for example, alcohol vs total SO2), which signaled these combinations were probably not the best predictors of wine quality. I tested these findings when building a linear model and excluded the worst contributors from the final version.

I’ve also bumped into a couple of obstacles along the way. First, I found out that the residual sugar distribution, when log-transformed, is bimodal across all wine grades. I’ve been struggling to explain this phenomenon for some time and even read a few specialized articles on the topic, but found no satisfactory explanation so far. So I’m inclined to believe this phenomenon is specific to Portuguese wines, since that’s what I’ve been analyzing all along.

Another thing I had difficulties with was the linear model that I’d built. It was able to explain only 28% of the variance in wine quality, which I found to be a pretty poor result. At first, I thought I was doing something wrong and actually spent a couple of days trying to engineer new features and combine them in various ways (to no avail), but then I realized that some other factors were at play and physico-chemical properties alone were not enough of a quality predictor.

And this realization leads me to suggestions on how to improve this analysis. First and foremost, more data would be nice. 5,000 wine samples is alright, but given the number of wines in the world, it’s just a drop in the ocean. Besides, the dataset is restricted to only Portuguese wines, which significantly limits its value and ability to represent the whole population. Second, as I mentioned above, there must be some other features that heavily influence wine quality. Better results might have been obtained if we had information about a region where a wine was produced, the year it was produced, grape type, selling price and wine brand, to name a few. Also, it might be a good idea to test other kinds of models and see how they fare against each other. I guess more powerful models, like SVM or tree-based methods, could have demonstrated impressive results.